The Bible, truth, and multilingual OCR evaluation
نویسندگان
چکیده
Multilingual OCR has emerged as an important information technology, thanks to the increasing need for crosslanguage information access. While many research groups and companies have developed OCR algorithms for various languages, it is di cult to compare the performance of these OCR algorithms across languages. This di culty arises because most evaluation methodologies rely on the use of a document image dataset in each of these languages and it is di cult to nd document datasets in di erent languages that are similar in content, layout, and fonts. In this paper we propose to use the Bible as a dataset for comparing OCR accuracy across languages. Besides being available in a wide range of languages, Bible translations are closely parallel in content, carefully translated, surprisingly relevant with respect to modern-day language, and quite inexpensive. A project at University of Maryland is currently implementing this idea. We have created a scanned image dataset with groundtruth from an Arabic Bible. We have also used image degradation models to create synthetically degraded images of a French Bible. We hope to generate similar Bible datasets for other languages, and we are exploring alternative corpora with similar properties such the Koran and the Bhagavad Gita. Quantitative OCR evaluation based on the Arabic Bible dataset is currently in progress.
منابع مشابه
The Bible , Truth , and Multilingual OCR
Multilingual OCR has emerged as an important information technology, thanks to the increasing need for cross-language information access. While many research groups and companies have developed OCR algorithms for various languages, it is diicult to compare the performance of these OCR algorithms across languages. This diiculty arises because most evaluation methodologies rely on the use of a do...
متن کاملEvaluating SEE - A Benchmarking System for Document Page Segmentation
The decomposition of a document into segments such as text regions and graphics is a significant part of the document analysis process. The basic requirement for rating and improvement of page segmentation algorithms is systematic evaluation. The approaches known from the literature have the disadvantage that manually generated reference data (zoning ground truth) are needed for the evaluation ...
متن کاملTitle of Thesis : GROUNDTRUTH GENERATION AND DOCUMENT IMAGE DEGRADATION
Title of Thesis: GROUNDTRUTH GENERATION AND DOCUMENT IMAGE DEGRADATION Gang Zi, Master of Science, 2005 Thesis Directed By: Professor Rama Chellappa Department of Electrical and Computer Engineering University of Maryland at College Park The problem of generating synthetic data for the training and evaluation of document analysis systems has been widely addressed in recent years. With the incre...
متن کاملReducing OCR Errors by Combining Two OCR Systems
This paper describes our efforts in building a heritage corpus of Alpine texts. We have already digitized the yearbooks of the Swiss Alpine Club from 1864 until 1982. This corpus poses special challenges since the yearbooks are multilingual and vary in orthography and layout. We discuss methods to improve OCR performance and experiment with combining two different OCR programs with the goal to ...
متن کاملAn OCR Free Method for Word Spotting in Printed Documents: the Evaluation of Different Feature Sets
An OCR free word spotting method is developed and evaluated under a strong experimental protocol. Different feature sets are evaluated under the same experimental conditions. In addition, a tuning process in the document segmentation step is proposed which provides a significant reduction in terms of processing time. For this purpose, a complete OCRfree method for word spotting in printed docum...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999